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Abstract 


Moving software development into the engineering arena requires controllability, and to control a 
process, it must be measurable. Measuring the process does no good if the product is not also 
measured, i.e., being the best at producing an inferior product does not define a quality process. 
Also, not every number extracted from software development is a valid measurement. A valid 
measurement only results when we are able to verify that the number is representative of the attribute 
that we wish to measure. Many proposed software metrics are used by practitioners without these 
metrics ever having been validated, leading to costly but often useless calculations. Several 
researchers have bemoaned the lack of scientific precision in much of the published software 
measurement work and have called for validation of software metrics by measurement theory. This 
dissertation applies measurement theory to validate fifty proposed object-oriented software metrics 
(see Li and Henry, 1993; Chidamber and Kemerrer, 1994; Lorenz and Kidd, 1994). 


I. Background and Objectives 


The need for software metrics 

Software development historically has been the arena of the artist. Artistically developed code 
often resulted in arcane algorithms or spaghetti code that was unintelligible to those who had to 
perform maintenance. Initially only very primitive measures such as lines of code (LOC) and 
development time per stage of the development life cycle were collected. Projects often ran over 
estimated time and over budget. In the pursuit of greater productivity, software development evolved 
into software engineering. Part of the software engineering concept is the idea that the product 
should be controllable. DeMarco [1982] reminds us that what is not measured cannot be controlled. 

Measurement is the process whereby numbers or symbols are assigned to attributes of entities 
in such a manner as to describe the attribute in a meaningful way We cannot take measurements and 
then apply them to just any attributes. Unfortunately this is exactly what the software development 
community has been doing. [Fenton, 1994] 

Because people observe things differently (and often intuitively feel differently about things), 
a model is usually defined for the entities and attributes to be measured. The model requires everyone 
to look at the subject from the same viewpoint. Fenton [1994] uses the example of human height. 
Should posture be taken into consideration when measuring human height? Should shoes be allowed? 
Should we measure to the top of the head or the top of the hair? The model forces a reasonable 
consensus upon the measurers. 

As has already been stated, control of a process or product requires that the process or 
product is measurable, therefore, control of software requires software measures [Baker, et al.. 
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1990]. It does no good to measure the process if the product is not measured. Being the best at 
producing an inferior product does not define a quality process. 

The need for metric validation 

Choosing metrics becomes a horse and cart or a chicken and egg type of question. Which do 
we do first, choose the metrics of interest or validate the metrics? Since these metrics are already in 
use, I have chosen to validate them first. The next step will be to choose from among the measures 
(valid metrics) a suite of them that is the smallest set of measures that is both necessary and sufficient 
to measure the important dimensions of the software. The steps involved are. 

1 . Identify important dimensions of the software. 

2. Classify measures by the dimension(s) they measure. 

3. Use multivariate statistical methods to investigate the parallelism/ orthogonality of the 
captured measures. 

It is not beneficial to measure the same dimension of an object by more than one method. Each 
method will have its own degree of accuracy and its own cost of application. Once the necessary 
degree of accuracy has been established, the most cost effective method that delivers that level of 
accuracy should be the measurement of choice. When building models with unvalidated metrics the 
degree of accuracy cannot be known. 

Fenton [1994] argued that much of the software measurement work published to date is 
scientifically flawed. Fenton is not the only scientist who has observed this lack of scientific precision. 
Baker, et al., [1990] said as much when they wrote that research in software metrics often is suspect 
because of a lack of theoretical rigor. Li and Henry [1993a] argued that validation is necessary for 
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the effective use of software metrics. Schneidewind [1992] stated that metrics must be validated to 
determine whether they measure what it is they are alleged to measure. Weyuker [1988] stated that 
existing and proposed software measures must be subjected to an explicit and formal analysis to 
define the soundness of their properties. 

McCabe failed to validate his complexity metric. Gilb referenced empirical testing as his 
source of verification and validation, i.e., there was no theoretical validation of Gilb's metrics 
Halstead's equations were tested statistically. McCall defined metrics based on heuristics. A metric 
was accepted by McCall if a chosen sample fell within a 90% confidence interval [McCall, et al., 
1977] DeMarco employed no theoretical base in the validation of his metrics. Li and Henry [1993] 
used statistical analysis to validate the prediction of maintenance effort by the group of metrics that 
they published. No theoretical validation was attempted by Li and Henry. Chidamber and Kemerer 
mentioned measurement theory in their evaluation of each metric but made no attempt to assign a 
scale to the metrics (see the paragraph on scales in section II for an explanation of the importance of 
scale to the valid interpretation of a measurement). Lorenz and Kidd [ 1 994] only used heuristics to 
validate their metrics. 

Software metrics and measurement theory 

Measurement theory was first used in software metric research to validate the myriad 
complexity metrics which dominated the early research in the field. Correlations were expected to 
exist between the complexity of a project and the achievement of acceptable parameters in its 
development. This was the rationale for the interest in software complexity and the development of 
metrics to measure this complexity [Anderson, 1992], 
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When defining a measure, first one must designate precisely the attribute to be measured, e g., 
the height of humans Then a model is specified that captures the attribute, eg., stand up straight, 
take off your shoes, do not include hair height in the measurement. The congruence that comes from 
the model must represent the attribute being measured, i .e , the intuitive order of the objects, with 
respect to the attribute being measured, must be preserved by the model Finally, an order-preserving 
map from the model to a number system is defined, e.g., if we observe that Harry is taller than Dick, 
any measurement that we take of their height must result in numbers or symbols that preserve this 
relationship. [Baker, et al., 1990] 

Before a model can be proposed, it must be known what is being measured. This basic 
measurement principle has been ignored in much of the software metric work of record. It is 
fundamental to measurement theory that the measurer have an intuitive understanding, usually based 
on observation, of the attribute being measured ' Fenton, 1991], 

The object-oriented paradigm 

An object combines both data structure and behavior in a single entity. Object-oriented 
software is organized as a collection of explicit objects. By contrast, data structure and behavior are 
loosely connected in traditional programming [Rumbaugh, et al , 1991]. Authors have not been in 
agreement about the characteristics that identify the object-oriented approach. Henderson-Sellers 
[1991] listed information hiding, encapsulation, objects, classification, classes, abstraction, 
inheritance, polymorphism, dynamic binding, persistence, and composition as having been chosen by 
at least one author as a defining aspect of object-orientation. Rumbaugh, et al. [1991 J added identity. 
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Smith [1991] added angle type and Sully [1 99?] added the unit building block to this list of d dining 
aspects. 

The old software metrics do not take into consideration these new concepts. Therefore, these 
characteristics necessitate the advent of new metrics to measure object-oriented software The 
recent explosion of object-oriented software metrics (Li and Henry, 1993; Chidamber and Kemerer, 
1994; and Lorenz and Kidd, 1994) has hit the scene with little validation beyond regression analysis 
of observed behavior. 

Research objectives 

" Validation of a software measure is the process of ensuring that the measure is a proper 
numerical characterization of the claimed attribute" [Baker, et al., 1990]. Fenton [1991] described 
two meanings of validation. Validation in the narrow sense is the rigorous measurement of the 
physical attributes of the software. Validation in the wide sense determines the accuracy of any 
prediction system using the physical attributes of the software. Accurate prediction is possibly the 
most valuable outcome to be gained from software measurement. Prediction systems are validated 
by empiric experiments. Accurate prediction relies on careful measurement of the predictive 
attributes and careful observation of the dependent attributes A model which accurately measures 
the attributes is necessary but not sufficient for building an accurate prediction system [Fenton, 1994] 

In the past, validation in the wide sense has been conducted without first carrying out 
validation in the narrow sense. In this dissertation we intend to validate in the narrow sense the 
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object-oriented software metrics that have appeared in the literature. This is a necessary step before 
these metrics can be used to predict such managerial concerns as cost, reliability, and productivity. 
Fenton [1991] states: "Good predictive theories only follow once we have rigorous 
measures of specific well understood attributes." 

II. Research Approach and Methodology 


Introduction 

There are two fundamental problems in measurement theory, the first is the representation 
problem. The representation problem is to find sufficient conditions for the existence of a mapping 
from an observed system to a given mathematical system. Another aspect of the representation 
problem is pointed out by Weyuker [1988], How unique is the result of the measurement? A 
measurement system must provide results that enable us to distinguish one class of object from 
another class of object. 

The other fundamental problem of measurement theory is the uniqueness problem. 
Uniqueness theorems define the properties and valid operations of different measurement systems 
and tell us what type of scale results from the measurement system. A uniqueness theorem 
contributes to a theory of scales which says tha, the scale used dictates the meaningfulness of 
statements made about measures based on the scale [Hong, et al., 1993; Roberts, 1979], A 
statement involving numerical scales is meaningful if the truth of the statement is maintained when 
the scale involved is replaced by another (admissible) scale. 



The empirical/formal relational system. A relational system is a way of relating one entity (or 
one event) of a set to another entity (or event) of the same set. In the physical sciences the relations 
take the form longer than, heavier than, of equal volume, etc. In the social sciences (and thus in 
software metric measuiement) the relations take the form is preferred to, is not preferred to, is at 
least as good as. 

Definition 2. 1 : The ordinal relational system is an ordered tuple (A,R1 Rn) where 

A is a nonempty set of objects and the Ri, i~ I n are k-ary relations on A. [Zuse, 

1990] 

The extensive structure. The extensive structure is an expansion of the ordinal relation system to 
include binary operations on the objects of the set. The extensive structure is required to measure 
objects on the interval or ratio scales. The binary operation in the empirical relational system 
usually is designated concatenation, denoted by @ The usual manifestation of the binary operation 
in the formal relational system is addition (-*-) although multiplication may be the proper operation 
under some circumstances. 

Definition 2.2: The extensile relational system is an ordered tuple 

( A,Rl .. ,Rn,«I m) where A is a nonempty set of objects, theRi, i= 1 ,...,n are k-ary 

relations on A and the 9j,j-l m are closed binary relations. [Zuse, 1990] 

Homomorphism. A software measurement can be a homomorphism only if the meaning and 
interpretation of the empirical relationship is clear [Zuse, 1990]. Let ► denote is larger than (or is 
preferred to). Given the empirical scale ^A,**,®) which we wish to measure using the real numbers, 
we must map £ to SR(B,>,+) while preserving the relation >■ and the operation ®, i.e., M: A-+B is 
a valid mapping from A to B iff al > a2 — bl > b2. In order to know whether or not the relation 
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and the operation have been preserved, the meaning and interpretation of ^,A,>*-,and ® must be 
precisely defined. 

The weak order. Suppose you must select from a list of alternatives. For each pair of alternatives 
al and a2, you prefer al to a2, you are indifferent between al and a2, or you prefer a2 to al. If you 
always prefer a l to a2, you are said to have a strict preference. If, however, you sometimes prefer 
al to a2 and sometimes you aie indifferent between al and a2, you are said to have a weak 
preference. When you have a weak preference and the measurements exhibit the axioms of 
completeness, reflexiveness, and transitivity, the alternatives are said to constitute a weak order. 

Meaningfulness. When does it make sense to state: 

• Program A is more complex tha.i program B? 

• Program A is twice as complex as program B? 

• Program A is twice as maintainable as program B? 

• Program A displays more quality than program B? 

• The quality of program A was increased by 20 %? 

Following Zuse [1990], a statement is meaningful if and only if the truth of the statement holds 
against all admissible transfer lations. Therefore, the meaningfelness of these statements depends 
on the scale assignable to the metric used to measure the attribute of question. 



I 
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Table 1 


Properties of Meas u rement Scales 


— 

Scale 

Basic empirical operations 

Admissible transformations 

Ratio 

=,<,>, equality of intervals, and 
ratios 

Nf^oM, a>0 similarity transformation 

Interval 

=,<,>, and equality of intervals 

M’=aM+^, a>0 positive linear 
transformation 

Ordinal 

=,<, and > 

VP=./(M) where y(M) is any monotonic 
increasing transformation 

Nominal 

= 

ary one-to-one transformation 


Scales. When groups of objects are measured on the nominal scale: many statistics can not be used, 
proportions can be taken; the mode is the only meaningful measure of centrality. When groups of 
objects are measured on the ordinal scale: rank order statistics and non-parametric statistics can be 
used (assuming that the necessary probability distribution can be reasonably assumed to be present), 
the median is the most powerful meaningful measure of centrality. When groups of objects are 
measured on the interval scale: parametric statistics as well as all that apply to ordinal scales can be 
used (it must be reasonable to accept that the necessary probability distribution is present); the 
arithmetic mean is the most powerful meaningful measure of centrality. When groups of objects are 
measured on the ratio scale: percentage calculations as well as all statistics that apply can be used; 
the arithmetic mean is the most powerful meaningful measure of centrality. 
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Desirable properties of measures 

Intuition. A measure should make sense based upon the professional experience of the measurer. 
Objects that appear better in the attribute being measured (based on the observer's experience) 
should score higher on the metric being used. Objects which appear similar should score roughly 
about the same. 

Monotony. Monotony (or consistency) goes along with intuition. The measurement must be such 
that very nearly the same score is achieved regardless of the measurer. Also, the order that the 
objects appear in, in relation to each other, must be consistent from measurement to measurement. 

Mathematical foundation. It is important that the measure be grounded in mathematical theory. 
This foundation is necessary but not sufficient to make the metric an appropriate gauge of the 
property being measured. 

Understandability. The measurement process as well as the meaning of the metric should be 
understandable by interested persons [Tsai, et al., 1986 (as cited in Zuse, 1990)]. 

Variation. If all articles score the same on a metric, then that metric measures nothing. In order 
to measure a property there must be variation in measurement from object to object. 

Dispersion. A measure is not precise enough if all articles fall into only a few categories. Ideally, 
the measure should be sensitive enough to measure the appropriate property on a continuum. 
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Especially grievous is the case that assigns the property to a set with discrete units of limited 
cardinality [Weyuker, 1988 (as cited in Zuse, 1990)]. 

Before a model can be proposed, it must be known what is being measured. This basic 
measurement principle has been ignored in much of the software metric work of record. It is 
fundamental to measurement theory that the measurer have an intuitive understanding, usually based 
on observation, of the attribute being measured [Fenton, 1991]. 

The basis of the methodology to be followed will be Zuse's model. 

Zuse's model 

Before a metric can be said to possess scale, 1) enough atomic modifications must be defined 
to completely describe any changes that can affect the metric, 2) the partial properties of the metric 
must be ascertained, and 3) the intuition of the measurer must agree with the partial properties 
established. 

The concatenation operator for each metric must be defined based on the properties of the 

metric. Since Zuse always evaluated static measures of software code, he used the sequential and 

alternative structures offlowgraphs to define the concatenation operation. 

Definition 2.3: A flowgraph G=(E,N,s,t) is a directed graph with a finite, nonempty 
set of nodes N, a finite, nonempty set of edges E, a start node seN, and a terminal 
node teN. Each nock xeN lies on some path in G from stot along the edges. An 
edge is an ordered pair of nodes (x ,y). [Zuse, 1990]. 

Figure 1 is a flowgraph. Nodes 3, 7, and 1 1 are called predicate (decision) nodes. Nodes 
4, S, 8, and 9 are called processing nodes. An atomic modification to a flowgraph is defined as 
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adding, deleting, or transferring edges or nodes in the flowgraph [Zuse, 1990], Specifically, we 
define: 

AMI as adding (deleting) an edge at an arbitrary location, 

AM 2 as adding (deleting) a node and an edge at an arbitrary location, and 

AM3 as transferring an edge from one location in a flowgraph to another location. 

Every metric increases, decreases, or remains 
the same in reaction to each of these atomic 
modifications. The partial property of the metric is 
defined as the sensitivity of the metric to an atomic 
modification, i.e., the measure M has the partial 
property <=> (either it is less desirable, you have 
indifference, or it is more desirable) with respect to the 
atomic modification AM. 

A measure can be placed on the ordinal scale if 
the user accepts the partial properties of the atomic 
modifications defined for that measure and the axioms 
of the weak order (completeness, refiexiveness, and 
transitivity) hold. A measure can be used as an interval scale if all conditions of the ordinal scale 
are met and the distance defined on the interval is consistent for all intervals. A measure can be 
placed on the ratio scale if all conditions of the ordinal scale are met and the user accepts the binary 
concatenation operation(s) defined on the measure. 



Fig. 1 A FIOWGRAPH 
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Let us now consider Zuse's methodology more specifically. 

Description of the measures. The original definition (as provided by the author of the measure) 
is given for each metric. Each metric is then defined using a uniform method. The flowgraphs of 
Zuse will be used whenever static code is being measured. Other, appropriate, structures will be 
defined as needed for each metric being validated. 

Examples of the calculation of the measures. Simple and uniform examples are given for each 
metric. 

Partial property description of the measures. Atomic modifications are used on each metric to 
describe its partial properties. Atomic modifications to flowgraphs consist of adding, deleting, and 
moving edges and nodes. Other atomic modifications will be developed as necessary for other 
structures. 

Complete description of the measures as an ordinal scale. Atomic modifications are defined 
sufficient to describe the criteria for the use of the metric as an ordinal scale then the measures are 
examined to determine if the axioms of the weak order (completeness, reflexiveness, and 
transitivity) hold. 
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Consideration of the measures as an interval scale. The mapping which results from the atomic 
modifications are compared to determine if a uniform difference between integer results can be 
discerned. 

Extensive structure and ratio scale. Binary concatenation operations to flowgraphs consist of 
sequential and alternative addition of two flowgraphs. When it is necessary to define another 
structure, other binary concatenation operations must also be defined. The ways the metrics respond 
to the binary concatenation operations, as defined, are investigated to determine whether or not the 
metric possesses the properties of the extensive structure. The rules are given for the use of the 
metrics as a ratio scale. 

Metric summary. The properties of the metric are summarized and compared to the properties of 
similar metrics. 

The seven steps of Zuse's model are applied to each metric to determine what meaningful 
statements may be made using the information gleaned from the metric. 
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III. Expected Contribution 

Contribution and significance of this study 

Many object-oriented metrics are being proposed. Because they have not been validated 
using measurement theory, it is not clear that these metrics are valid measures of the attributes that 
they claim to measure. Some of these metrics are touted as predictive without being rigorously 
defined. This study looks at each of the object-oriented metrics and scrutinizes them for validity in 
the narrow sense of Fenton [1991], 

Does the metric measure what its author proposes to measure? If not, what can be said about 
the metric in terms of what is being measured? Is there another metric which does measure the 
desired attribute? Are the statistics used with the metric valid considering the scale attributed to the 
metric? Is the measurement an assessment measurement or meant to be a predictive measurement? 
Does the metric hold up under vigorous scrutiny of the conditions of representation and uniqueness? 
Do intuitive and empirical understandings survive under all allowable transformations? 

The answers to these questions are pertinent to the valid use of these metrics. Since the 
collection of data for the calculation of metrics is very expensive [Deutsch and Willis, 1988], this 
study will help the practitioner by separating those object-oriented metrics that are not worth the cost 
of calculation from those that are and by differentiating those metrics that are valid for assessment 
purposes from those that are valid foi use in prediction systems. 

Additionally, the software engineering community should gain insight into further use of the 
metrics, other metrics which might replace them, the valid statistics that each metric supports, and 
future research that needs to be carried out. 
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